Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Jiachen Ai, UID:206182615

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.1       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.25    knitr_1.44        jsonlite_1.8.7    xfun_0.40        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.22    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(readr)
library(dplyr)
library(duckdb)
library(DBI)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:  8.000 GiB 
Freeram:   2.722 GiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 35936568
-rw-rw-r--@ 1 jacenai  staff     15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 jacenai  staff       427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 jacenai  staff       859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 jacenai  staff       578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 jacenai  staff        12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 jacenai  staff     25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 jacenai  staff      7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 jacenai  staff    508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 jacenai  staff    471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 jacenai  staff      1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff  13730083993 Jan  5  2023 labevents.csv
-rw-rw-r--@ 1 jacenai  staff   1939088924 Jan  5  2023 labevents.csv.gz
drwxr-xr-x@ 3 jacenai  staff           96 Feb  5 16:52 labevents.parquet
-rw-r--r--@ 1 jacenai  staff    133238879 Feb  5 15:28 labevents_filtered.csv.gz
-rw-rw-r--@ 1 jacenai  staff     96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 jacenai  staff      2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 jacenai  staff    398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 jacenai  staff    498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 jacenai  staff     25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 jacenai  staff    458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 jacenai  staff      6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 jacenai  staff       122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 jacenai  staff      6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 jacenai  staff     36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 65148984
-rw-rw-r--@ 1 jacenai  staff        35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 jacenai  staff  30204420231 Jan  5  2023 chartevents.csv
-rw-rw-r--@ 1 jacenai  staff   2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff        57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 jacenai  staff     45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff      2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 jacenai  staff    251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff    324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff     20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

Answer

First, we will read the admissions.csv.gz file using read.csv, read_csv, and fread functions, and compare the speed respectively.

mimic_path <- "~/mimic/hosp/"
# reading time for read.csv
system.time(data_read.csv <- read.csv(
  str_c(mimic_path,"admissions.csv.gz")))
   user  system elapsed 
  3.157   0.047   3.204 
# reading time for read_csv
system.time(data_read_csv <- read_csv(
  str_c(mimic_path,"admissions.csv.gz")))
   user  system elapsed 
  0.955   0.072   0.541 
# reading time for fread
system.time(data_fread <- fread(
  str_c(mimic_path,"admissions.csv.gz")))
   user  system elapsed 
  0.397   0.033   0.436 

Then, we will compare the memory usage.

# memory usage for read.csv
pryr::object_size(data_read.csv)
158.71 MB
# memory usage for read_csv
pryr::object_size(data_read_csv)
55.31 MB
# memory usage for fread
pryr::object_size(data_fread)
50.13 MB

Finally, we will compare the data types.

# data types for read.csv
str(data_read.csv)
'data.frame':   431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : chr  "2180-05-06 22:23:00" "2180-06-26 18:27:00" "2180-08-05 23:44:00" "2180-07-23 12:35:00" ...
 $ dischtime           : chr  "2180-05-07 17:15:00" "2180-06-27 18:49:00" "2180-08-07 17:50:00" "2180-07-25 17:55:00" ...
 $ deathtime           : chr  "" "" "" "" ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : chr  "2180-05-06 19:17:00" "2180-06-26 15:54:00" "2180-08-05 20:58:00" "2180-07-23 05:54:00" ...
 $ edouttime           : chr  "2180-05-06 23:30:00" "2180-06-26 21:31:00" "2180-08-06 01:44:00" "2180-07-23 14:00:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...
# data types for read_csv
str(data_read_csv)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# data types for fread
str(data_read_csv)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

In terms of the speed and based on the user time, fread is the fastest, followed by read_csv, and read.csv is the slowest. (according to the user time)

In terms of the memory usage of the resultant dataframe or tibble, fread uses the least memory, followed by read_csv, and read.csv uses the most memory.

In terms of the default parsed data types, fread and read_csv are similar if “double” and “integer” are categorized as “numeric” data type; However, read.csv is different in some variables’ data type.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

Answer

system.time(data_read_csv <- read_csv(
  str_c(mimic_path,"admissions.csv.gz"), 
  col_types = "nnTTTfcffffffTTn"))
   user  system elapsed 
  0.825   0.063   0.367 
pryr::object_size(data_read_csv)
43.24 MB

Yes. The run time became shorter because the data types of the variables are specified according to the data types in the original data. The memory that the resultant tibble uses is 43.24 MB which is smaller than before.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 jacenai  staff  1939088924 Jan  5  2023 /Users/jacenai/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings. Answer

read_csv(str_c("~/mimic/hosp/","labevents.csv.gz"))

I spent more than 5 minutes to read the file, but it was not finished. So, I aborted the program. The reason is that the size of the file is quite large (1.94GB), which means there is a significant amount of data to read and process. And since the file is compressed (labevents.csv.gz), it needs to be decompressed before it can be read. Decompressing large files can take time. Also, reading large files can require a significant amount of memory, and my system has limited memory.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

read_csv("~/mimic/hosp/labevents.csv.gz", 
         col_select = c("subject_id", 
                        "itemid", 
                        "charttime", 
                        "valuenum"))
# A tibble: 118,171,367 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  51237 2180-03-23 11:51:00      1.4
 2   10000032  51274 2180-03-23 11:51:00     15.1
 3   10000032  50853 2180-03-23 11:51:00     15  
 4   10000032  50861 2180-03-23 11:51:00    102  
 5   10000032  50862 2180-03-23 11:51:00      3.3
 6   10000032  50863 2180-03-23 11:51:00    109  
 7   10000032  50864 2180-03-23 11:51:00      8  
 8   10000032  50868 2180-03-23 11:51:00     12  
 9   10000032  50878 2180-03-23 11:51:00    143  
10   10000032  50882 2180-03-23 11:51:00     27  
# ℹ 118,171,357 more rows

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Answer

First, displaying the first 10 lines of the labevents.csv.gz file to know the structure of the file and the sequence of the columns.

zcat < ~/mimic/hosp/labevents.csv.gz | head

To extract specific columns and rows from the labevents.csv.gz file and save the result to a new file labevents_filtered.csv.gz

zcat < ~/mimic/hosp/labevents.csv.gz | \
awk -F ',' '{
    if ($5 ~ /^(50912|50971|50983|50902|50882|51221|51301|50931)$/) \
        print $2,$5,$7,$10
    }' | \
gzip > ~/mimic/hosp/labevents_filtered.csv.gz

To display the first 10 lines of the new file labevents_filtered.csv.gz

zcat < ~/mimic/hosp/labevents_filtered.csv.gz | head -n 10
10000032 50882 2180-03-23 11:51:00 27
10000032 50902 2180-03-23 11:51:00 101
10000032 50912 2180-03-23 11:51:00 0.4
10000032 50971 2180-03-23 11:51:00 3.7
10000032 50983 2180-03-23 11:51:00 136
10000032 50931 2180-03-23 11:51:00 95
10000032 51221 2180-03-23 11:51:00 45.4
10000032 51301 2180-03-23 11:51:00 3
10000032 51221 2180-05-06 22:25:00 42.6
10000032 51301 2180-05-06 22:25:00 5

To calculate the number of lines in the new file labevents_filtered.csv.gz. There are 24855909 lines in this new file.

zcat < ~/mimic/hosp/labevents_filtered.csv.gz | wc -l
 24855909

Finally, to measure how long it takes to ingest labevents_filtered.csv.gz. It takes around 216 seconds to ingest the new file labevents_filtered.csv.gz.

system.time({
  data <- read_csv("~/mimic/hosp/labevents_filtered.csv.gz")
})
   user  system elapsed 
 23.482   1.122  24.517 

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Answer

First, decompressing the labevents.csv.gz file

system("gzip -d -k ~/mimic/hosp/labevents.csv.gz")

Then, using arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3.

file_path <- "~/mimic/hosp/labevents.csv"

start_time <- Sys.time() # Record the start time

labevents <- arrow::open_dataset(file_path, format = "csv") %>%
  filter(itemid %in% c(50912, 50971, 50983, 50902, 
                       50882, 51221, 51301, 50931)) %>%
  select(subject_id, itemid, charttime, valuenum)

end_time <- Sys.time()  # Record the end time

execution_time <- end_time - start_time
execution_time
Time difference of 0.08673501 secs

The ingest+select+filter process takes around 0.05140901 secs. The number of rows in the result tibble is 24855909. Then, to display the first 10 rows of the result tibble and they match those in Q2.3.

# Convert the Arrow object to a data frame
labevents_df_2.4 <- as.data.frame(labevents)
# Display the number of rows
cat("Number of rows:", nrow(labevents_df_2.4), "\n")
Number of rows: 24855909 
# Display the first 10 rows
head(labevents_df_2.4, 10)
   subject_id itemid           charttime valuenum
1    10000032  50882 2180-03-23 04:51:00     27.0
2    10000032  50902 2180-03-23 04:51:00    101.0
3    10000032  50912 2180-03-23 04:51:00      0.4
4    10000032  50971 2180-03-23 04:51:00      3.7
5    10000032  50983 2180-03-23 04:51:00    136.0
6    10000032  50931 2180-03-23 04:51:00     95.0
7    10000032  51221 2180-03-23 04:51:00     45.4
8    10000032  51301 2180-03-23 04:51:00      3.0
9    10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5.0

Explanation of Apache Arrow: Apache Arrow is like a universal language for data that allows different software programs to communicate and share data more efficiently. It’s a technology that makes it easier for different systems to understand and work with large amounts of data, regardless of the programming language they are written in. So, it’s like a common ground where data can be exchanged and understood by various applications, making data analysis and processing faster and more seamless, just like how people from different countries can communicate more effectively using a common language.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Answer

First, re-writing the csv file labevents.csv in the binary Parquet format.

# Define the file paths
parquet_file_path <- "~/mimic/hosp/labevents.parquet"

# Rewrite the CSV file in Parquet format
arrow::write_dataset(arrow::open_dataset(file_path, format = "csv"), 
                     parquet_file_path, format = "parquet")

Then, measuring the time taken for ingest+select+filter process of the Parquet file.

start_time <- Sys.time() # Record the start time

parquet_dataset <- 
  arrow::open_dataset(parquet_file_path, format = "parquet") %>%
  filter(itemid %in% c(50912, 50971, 50983, 50902, 
                       50882, 51221, 51301, 50931)) %>%
  select(subject_id, itemid, charttime, valuenum)

end_time <- Sys.time()  # Record the end time
execution_time <- end_time - start_time
execution_time
Time difference of 0.143501 secs

The ingest+select+filter process of the Parquet file takes around 0.202384 secs. And the number of rows in the result tibble is 24855909. Then, to display the first 10 rows of the result tibble and their contents match those in Q2.3, even though the order of the rows may be different.

labevents_df_2.5 <- as.data.frame(parquet_dataset)
# Display the number of rows
num_rows <- nrow(labevents_df_2.5)
cat("Number of rows:", num_rows, "\n")
Number of rows: 24855909 
# Display the first 10 rows
head(labevents_df_2.5, 10)
   subject_id itemid           charttime valuenum
1    10000032  50882 2180-03-23 04:51:00     27.0
2    10000032  50902 2180-03-23 04:51:00    101.0
3    10000032  50912 2180-03-23 04:51:00      0.4
4    10000032  50971 2180-03-23 04:51:00      3.7
5    10000032  50983 2180-03-23 04:51:00    136.0
6    10000032  50931 2180-03-23 04:51:00     95.0
7    10000032  51221 2180-03-23 04:51:00     45.4
8    10000032  51301 2180-03-23 04:51:00      3.0
9    10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5.0

Explanation of Parquet format: Parquet is a file format for storing data that is designed to be efficient and optimized for use with big data processing frameworks. It’s like a special type of container that organizes data in a highly compressed and columnar format, making it easier for computers to read and process large amounts of data quickly. Think of it as a way to pack and organize data in a smart and efficient way, similar to how a well-organized filing cabinet makes it easy to find and access documents quickly.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

Answer

First, ingesting the Parquet file, converting it to a DuckDB table.

parquet_file_path_1 <- "~/mimic/hosp/labevents.parquet/part-0.parquet"

start_time <- Sys.time() # Record the start time

# Ingest the Parquet file and convert it to a DuckDB table
parquet_dataset <- arrow::read_parquet(parquet_file_path_1, 
                                       fomrat = "parquet")
duckdb_table <- arrow::to_duckdb(parquet_dataset)

# Select specific columns and filter rows based on itemid
result <- duckdb_table %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 
                              50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum)

end_time <- Sys.time() # Record the end time
execution_time <- end_time - start_time
execution_time
Time difference of 10.04156 mins

The ingest+convert+select+filter process takes around 38.68542 secs. And the number of rows in the result tibble is 24855909. Then, to display the first 10 rows of the result tibble and their contents match those in Q2.3, even though the order of the rows may be different.

labevents_df_2.6 <- as.data.frame(result)
# Display the number of rows
num_rows <- nrow(labevents_df_2.6)
cat("Number of rows:", num_rows, "\n")
Number of rows: 24855909 
# Display the first 10 rows
head(labevents_df_2.6, 10)
   subject_id itemid           charttime valuenum
1    10000032  50882 2180-03-23 11:51:00     27.0
2    10000032  50902 2180-03-23 11:51:00    101.0
3    10000032  50912 2180-03-23 11:51:00      0.4
4    10000032  50971 2180-03-23 11:51:00      3.7
5    10000032  50983 2180-03-23 11:51:00    136.0
6    10000032  50931 2180-03-23 11:51:00     95.0
7    10000032  51221 2180-03-23 11:51:00     45.4
8    10000032  51301 2180-03-23 11:51:00      3.0
9    10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5.0

Explanation of DuckDB: DuckDB is like a digital filing cabinet for data. It’s a special kind of software that helps organize and store large amounts of data in a way that makes it easy to find and work with. Think of it as a super-efficient librarian that keeps all your data neatly organized and quickly accessible. It’s especially useful for tasks like analyzing data, running calculations, and generating reports, making it a valuable tool for businesses and researchers who deal with lots of data.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Answer

First, unzipping the file chartevents.csv.gz

# Unzip the file
system("gzip -d -k ~/mimic/icu/chartevents.csv.gz")

Then, ingesting the CSV file, selecting columns, and filtering rows based on itemid.

file_path_3 <- "~/mimic/icu/chartevents.csv"

chartevents <- arrow::open_dataset(file_path_3, format = "csv") %>%
  filter(itemid %in% c(220045, 220181, 220179, 223761, 
                       220210)) 

Finally, displaying the number of rows and the first 10 rows of the result tibble. There are 22502319 rows in the result tibble.

# Convert the Arrow object to a data frame
chartevents_df_3 <- as.data.frame(chartevents)
# Display the number of rows
cat("Number of rows:", nrow(chartevents_df_3), "\n")
Number of rows: 22502319 
# Display the first 10 rows
head(chartevents_df_3, 10)
   subject_id  hadm_id  stay_id caregiver_id           charttime
1    10000032 29079034 39553978        47007 2180-07-23 14:01:00
2    10000032 29079034 39553978        47007 2180-07-23 14:01:00
3    10000032 29079034 39553978        47007 2180-07-23 15:00:00
4    10000032 29079034 39553978        47007 2180-07-23 15:00:00
5    10000032 29079034 39553978        47007 2180-07-23 15:00:00
6    10000032 29079034 39553978        47007 2180-07-23 15:00:00
7    10000032 29079034 39553978        66056 2180-07-23 12:00:00
8    10000032 29079034 39553978        66056 2180-07-23 12:00:00
9    10000032 29079034 39553978        66056 2180-07-23 12:00:00
10   10000032 29079034 39553978        66056 2180-07-23 12:00:00
             storetime itemid value valuenum valueuom warning
1  2180-07-23 15:15:00 220179    82       82     mmHg       0
2  2180-07-23 15:15:00 220181    63       63     mmHg       0
3  2180-07-23 15:15:00 220045    94       94      bpm       0
4  2180-07-23 15:15:00 220179    85       85     mmHg       0
5  2180-07-23 15:15:00 220181    62       62     mmHg       0
6  2180-07-23 15:15:00 220210    20       20 insp/min       0
7  2180-07-23 12:59:00 220045    97       97      bpm       0
8  2180-07-23 12:59:00 220179    93       93     mmHg       0
9  2180-07-23 12:59:00 220181    56       56     mmHg       0
10 2180-07-23 12:59:00 220210    16       16 insp/min       0